Merge RedistributeCPU and RedistributeGPU into one implementation#5268
Merge RedistributeCPU and RedistributeGPU into one implementation#5268atmyers merged 49 commits intoAMReX-Codes:developmentfrom
Conversation
|
GitLab CI 1529965 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1529965. |
|
/run-hpsf-gitlab-ci |
|
GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1532660. |
|
GitLab CI 1532660 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1532660. |
|
Note: the regression test for AMReX apps on gaira, garuda, and biollante look good, so long as these PRs are merged along with this one: |
|
I also confirmed that the GPU performance has not regressed, despite doing a little bit more work to support tiling: This PR: Dev: |
|
I have added back in an assertion that tiling is off on the GPU if neighbor particles are used. I will remove this limitation and add a better neighbor particles test in a follow-up PR. |
PR #5268 allowed particle tiling on GPU in Redistribute, but did not extend this support to neighbor particles. This PR adds support for this. It also extends to the existing test to support tiling and to more carefully check that the ghosted particle data is exactly right. Note that this test now reproduces all the particles on all ranks and therefore isn't suitable for running on many ranks. The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate
This merges RedistributeCPU and RedistributeGPU into one shared implementation that works for both, improving the maintainability of the code base. It also restructures the way OpenMP parallelism works in Redistribute on the CPU, resulting in better OpenMP performance and scaling. Another consequence is that particle tiling is now supported on GPU (although probably not desirable in most cases).
Performance results on a redistribute benchmark with 2 MPI ranks on a Perlmutter CPU node as a function of the number of OpenMP threads. "run" is this branch, "dev" is development. The new version is always an improvement and, at high thread counts, is >~ 2x faster.
When compiled for CPU with USE_OMP=FALSE, the new implementation is about 25% faster on that same benchmark, mostly owing to a new early exit in the partition step. The difference is more dramatic is cases with more particles per cell, like this example from WarpX.
Todo:
The proposed changes: